Introduction

This document contains the workflow used in the analysis of T. brucei gene co-expression network analysis. It contains code used in each step of the analysis.

Setting up R for the analysis

# set working directory
setwd("analysis/")

# ensure results are reproducible
set.seed(1)

# other settings
options(digits = 4)
options(stringsAsFactors = FALSE)

# loading required R packages
library("dupRadar")
library("Rsubread")
library("limma")
library("edgeR")
library("RColorBrewer")
library("ggplot2")
library("gplots")
library("reshape2")
library("ggfortify")
library("xlsx")
library("WGCNA")
library("flashClust")
library("igraph")
library("tidyverse")

# set number of threads to allow in WGCNA analysis
allowWGCNAThreads(nThreads=20)

Data acquisition

Data used in this study is obtained from European Nucleotide Archive under accession number SRP002243 and SRR965341.

First, metadata for the data is obtained from EBI as follows:

#Obtain metadata information for the data used in this study from ENA and SRA databases.

# ENA metadata
# code adapted from: https://wiki.bits.vib.be/index.php/Download_read_information_and_FASTQ_data_from_the_SRA

accession <- "SRP002243" # similarly, obtain data for SRR965341
ena.url <- paste("http://www.ebi.ac.uk/ena/data/warehouse/filereport?accession=",
                 accession,
                 "&result=read_run",
                 "&fields=run_accession,library_name,",
                 "read_count,fastq_ftp,fastq_aspera,",
                 "fastq_galaxy,sra_ftp,sra_aspera,sra_galaxy,",
                 "&download=text",
                 sep="")
ENA.metadata <- read.table(url(ena.url), header=TRUE, sep="\t")

# SRA metadata
SRA.metadata <- read.table("../data/SraRunTable.metadata.txt", header = TRUE, sep = "\t")

# create a text file with urls to fastq files in ENA database
fastq.urls <- ENA.metadata[grepl("fastq_ftp", names(ENA.metadata))]
write.csv(fastq.urls, file="../data/fastq.urls.txt", eol = "\r\n", quote = FALSE, row.names = FALSE)

# obtain sample metadata to be used later in analysis in R.
matches <- c("Run","Library_Name","Sample_Name")
sample.metadata <- SRA.metadata[grepl(paste(matches, collapse="|"), names(SRA.metadata))]

# create grouping factor that will place each sample in the one of three tissues i.e.
# midgut (MG), proventriculus(PV) and salivary glands (SG)
tissue <- factor(c("MG", "MG", "MG", "MG", "MG", "PV", "PV", "SG", "SG", "SG", "SG", 
                   "MG", "MG", "PV", "SG", "SG", "PV"))

# append factor to sample.metadata to group samples
sample.metadata["Tissue"] <- tissue

# The sample below was analysed separately as the reads are paired-end while
# the other samples are single-end.
#
# Add sample from Telleria et al 2014 study (SRR965341) to sample metadata.
sample.metadata$Run <- as.character(sample.metadata$Run)

sample.metadata <- rbind(sample.metadata, "18" = c("SA1", "SRR965341", "SA1", "SG"))

#####################################################################################
# Exclusion of the following samples was done after analysis showed they
# were below required standards

# remove a sample with less than 10M reads from the analysis (SRR039951)
sample.metadata <- sample.metadata[-15,]

# remove 2 samples that have technical duplicates (SRR039937 and SRR039938)
sample.metadata <- sample.metadata[-9,]
sample.metadata <- sample.metadata[-8,]
#####################################################################################
# print out the sample metadata table
sample.metadata
##    Library_Name       Run Sample_Name Tissue
## 1           mg1 SRR039378         MG1     MG
## 2           mg1 SRR039381         MG1     MG
## 3           mg1 SRR039453         MG1     MG
## 4        MG2_SL SRR039454         MG2     MG
## 5        MG2_SL SRR039455         MG2     MG
## 6        PV2_SL SRR039456         PV2     PV
## 7        PV2_SL SRR039457         PV2     PV
## 10       SA2_SL SRR039939         SA2     SG
## 11       SA2_SL SRR039940         SA2     SG
## 12          mg1 SRR039948         MG1     MG
## 13       MG2_SL SRR039949         MG2     MG
## 14       PV2_SL SRR039950         PV2     PV
## 16       SA2_SL SRR039952         SA2     SG
## 17       PV2_SL SRR042429         PV2     PV
## 18          SA1 SRR965341         SA1     SG

Next, RNASeq data is downloaded from EBI database’s FTP site.

cat ../scripts/fastq_download.sh
## #!/bin/bash
## #
## #Script to download fastq files from European Nucleotide Archive
## #
## FILE=$1 #File containing fastq url links to EBI FTP site
## 
## OUT_DIR=../data/raw_data/ 
## 
## cat ${FILE} | xargs -n1 wget$2 -P ${OUT_DIR}

Some of the downstream tools require that FASTQ files that were downloaded in zipped form are unzipped.

cat ../scripts/unzip.sh
## #!/bin/bash
## #
## #Script to decompress fastq.gz files
## #
## FASTQ_FILES=../data/raw_data/*.fastq.gz
## 
## for file in ${FASTQ_FILES}; do
##     gunzip ${file}
## done

Data quality assessment

After downloading the RNASeq data, its quality is checked through the FASTQC tool whose output is a report in HTML format.

cat ../scripts/fastqc_reports.sh
## #!/bin/bash
## #
## #Script to run FastQC reports using the FastQC tool
## #
## #load fastqc module
## module load fastqc/0.11.4
## 
## FASTQ_DIR=../data/raw_data/*.fastq
## 
## #create output directory if it doesn't exist.
## mkdir -p ../results/fastqc_reports
## 
## REPORTS_DIR=../results/fastqc_reports/
## 
## for file in ${FASTQ_DIR}; do
##    fastqc -f fastq -o ${REPORTS_DIR} ${file}
## done

Following the high rate of duplicate reads after FASTQC analysis, further analysis is done to ascertain their cause. Duplicate reads are assessed whether they arise from artifacts in PCR (PCR duplicates) or from biological causes (highly expressed genes). This is done later in the analysis after read mapping. Below is the sequence duplication level plot.

Downloading T. brucei and G. morsitans genome and annotation files

Genomes are obtained from their respective databases before alignment. The steps to obtain Glossina morsitans genome and its annotation file is indicated here as part of documenting the process. It is used in the analysis for reasons stated further in the reads mapping section. The genome and annotation files are downloaded from the TriTrypDB and vectorbase databases as follows:

#Downloading T. brucei genome

wget https://tritrypdb.org/common/downloads/release-43/TbruceiTREU927/fasta/data/TriTrypDB-43_TbruceiTREU927_Genome.fasta \
-P ../data/tbrucei_genome/

#Downloading the GFF file
wget https://tritrypdb.org/common/downloads/release-43/TbruceiTREU927/gff/data/TriTrypDB-43_TbruceiTREU927.gff \
-P ../data/tbrucei_genome_annotations_GFF/

# convert the tbrucei gene annotation from GFF format to GTF (required by some downstream tools)
# uses gffread from cufflinks
mkdir -p ../data/tbrucei_genome_annotations_GTF

gffread ../data/tbrucei_genome_annotations_GFF/TriTrypDB-43_TbruceiTREU927.gff \
-T -o ../data/tbrucei_genome_annotations_GTF/TriTrypDB-43_TbruceiTREU927.gtf

# Download T. brucei annotated transcripts (for use in UTR motif discovery) 
wget https://tritrypdb.org/common/downloads/release-43/TbruceiTREU927/fasta/data/TriTrypDB-43_TbruceiTREU927_AnnotatedTranscripts.fasta \
-P ../data/tbrucei_annotated_transcripts/

# Downloading Glossina genome
wget https://www.vectorbase.org/download/glossina-morsitans-yalescaffoldsgmory1fagz \
-P ../data/glossina_genome_scaffolds/

# Downloading GTF file
wget https://www.vectorbase.org/download/glossina-morsitans-yalebasefeaturesgmory19gtfgz \
-P ../data/glossina_genome_annonations_GTF/

Alignment of reads on the genome (Read Mapping)

The first step is indexing the genome using HISAT2 followed by alignment of the reads. The output is SAM files.

Indexing the genome

cat ../scripts/hisat2_index.sh
## #!/bin/bash
## #
## #Script to index genome using HISAT2
## #
## GENOME_FILE=$1
## 
## hisat2-build ${GENOME_FILE} bru-mor_genome_index_hisat2

Aligning the reads to the genome

cat ../scripts/hisat2_align.sh
## #!/bin/bash
## #
## #Script to align reads to the indexed genome using HISAT2
## #
## for fastq in ../data/raw_data/*.fastq; do
##     fqname=$(echo $fastq | cut -f1 -d '.')
## 
##      hisat2 \
##       -x bru-mor_genome_index_hisat2 \
##       -U ${fastq} \
##       -S ${fqname}.sam \
##       -p 8 \
##      --summary-file ${fqname}.txt \
##      --new-summary
## done
## 
## #move the output sam files to a new directory
## 
## #create directory if not exists
## mkdir -p ../data/processed_data/bru-mor_sam
## 
## #move the sam files
## mv ../data/raw_data/*.sam ../data/processed_data/bru-mor_sam/

Following the high rate of unaligned reads, further analysis is done to ascertain their cause. The reads are aligned to Glossina morsitans genome to determine whether reads from the vector were also present in the sample during sequencing (Dual RNA-Seq). T. brucei and G. morsitans genome files are concatenated into a single fasta file which is used during the alignment of the reads. This also ensures no cross-mapping of reads take place.

# make a directory to store the concatenated genomes
 mkdir -p ../data/brucei-morsitans

# copy the genome files to the created directory and concatenate them
cp ../data/glossina_genome_scaffolds/glossina-* ../data/brucei-morsitans/
cp ../data/tbrucei_genome/*.fasta ../data/brucei-morsitans/
cat ../data/brucei-morsitans/*.fa* > ../data/brucei-morsitans/brucei-morsitans_genomes.fasta

HISAT2 is used to re-align reads on the T. brucei and G. morsitans genomes, begining with indexing the genomes. Below is the alignment results from HISAT2 alignment tool.

Assessment of the duplication rate

At this point, quality control to assess the duplication rate can be performed. First, the SAM files are converted to sorted BAM files required by dupRadar tool.

cat ../scripts/sam-to-bam.sh
## #!/bin/bash
## #
## #Script to convert sam files to bam files
## #
## for sam_file in ../data/processed_data/bru-mor_sam/*.sam; do
##  sam_file_name=$(echo $sam_file | cut -f1 -d '.')
##      samtools view -S -b $sam_file > ${sam_file_name}.bam
## done
## 
## # move the created bam files to a new directory
## mkdir -p ../data/processed_data/bru-mor_bam
## 
## mv ../data/processed_data/bru-mor_sam/*.bam ../data/processed_data/bru-mor_bam/

The BAM files are then sorted using samtools

cat ../scripts/sort_bam.sh
## #!/bin/bash
## #
## #Script to sort BAM file
## #
## for bam_file in ../data/processed_data/bru-mor_bam/*.bam; do
##     bam_file_name=$(echo $bam_file | cut -f1 -d '.')
##      samtools sort $bam_file -o ${bam_file_name}.sorted.bam
## done

Next, duplicates are marked in the BAM files using Picard.

cat ../scripts/mark_dupes.sh
## #!/bin/bash
## #
## #Script to mark duplicates in BAM files using picard
## #
## for bam_file in ../data/processed_data/bru-mor_bam/*.sorted.bam; do
##     bam_file_name=$(echo $bam_file | cut -f1 -d '.')
## 
##      /opt/apps/picard-tools/1.119/bin/MarkDuplicates \
##          I=$bam_file \
##          O=${bam_file_name}.dupMarked.bam \
##          M=${bam_file_name}.dupMetrics.txt
## done

At this point, dupRadar tool is used to perform quality control in R. Before this quality control can be performed, we need to verify whether the reads are stranded or not as this is a required parameter for dupRadar as well as HTSeq tool later in the analysis. This can be done using RSeQC package - An RNA-seq Quality Control Package. infer_experiment.py module is used in this case.

First we convert T. brucei genome annotation GTF file into bed format required by RSeQC package. Then we use infer_experiment.py to verify strandedness using a few samples.

# convert GTF genome annotation to BED format using a custom script from:
#https://github.com/ExpressionAnalysis/ea-utils/tree/master/clipper

../scripts/gtf2bed.pl ../data/tbrucei_genome_annotations_GTF/TriTrypDB-43_TbruceiTREU927.gtf > \
../data/tbrucei_genome_annotations_GTF/TriTrypDB-43_TbruceiTREU927.bed

infer_experiment.py -i ../data/processed_data/bru-mor_bam/SRR039381.bam \
-r ../data/tbrucei_genome_annotations_GTF/TriTrypDB-43_TbruceiTREU927.bed

# output
#This is SingleEnd Data
#Fraction of reads failed to determine: 0.0008
#Fraction of reads explained by "++,--": 0.3832
#Fraction of reads explained by "+-,-+": 0.6161

The next step is to run the dupRadar quality control analysis setting the stranded parameter as FALSE as the reads are not strand specific.

# Parameters
bam_file <- "../data/processed_data/bru-mor_bam/SRR039951.dupMarked.bam"
gtf_file <- "../data/brucei-morsitans/brucei-morsitans_annotations.gtf"
stranded <- 0
paired <- FALSE
threads <- 12

# Duplication rate ananlysis
dm <- analyzeDuprates(bam_file, gtf_file, stranded, paired, threads)

#Plots
png(filename = "../figures/duplication_rate/SRR039951.png")
duprateExpDensPlot(DupMat=dm)
title("SRR039951")
dev.off()

# Boxplot
duprateExpBoxplot(DupMat=dm)

Below are representative plots of samples with PCR duplicates and without PCR duplicates. Samples with PCR duplicates are removed from the analysis before counts data are read into R.

Sample without PCR duplicates

Sample without PCR duplicates

Sample with PCR duplicates

Sample with PCR duplicates

Reads quantification

HTSeq tool is used to count reads that aligned to the T. brucei genome. T. brucei annotation file is used and therefore HTSeq excludes counting G. morsitans reads that aligned to Glossina genome. The output is a text file for each sample that contains the number of reads that were counted for each gene.

cat ../scripts/htseq_counts.sh
## #!/bin/bash
## #
## #Script to counts the number of reads aligned to T. brucei genome using HTSeq.
## #Resource: HTSeq documentation https://htseq.readthedocs.io/en/latest/count.html
## #
## module load htseq/0.11.2
## 
## #create output directory if it doesn't exist
## mkdir -p ../results/brucei_HTSeq_count_results
## 
## GFF_FILE=$1
## 
## for bam_file in ../data/processed_data/bru-mor_bam/*.bam; do
##     bam_file_name=$(echo $bam_file | cut -f1 -d '.')
##     
##         python /opt/apps/htseq/0.11.2/bin/htseq-count \
##             -f bam \
##             -s no \
##             -t exon \
##             -i Parent \
##             $bam_file \
##             $GFF_FILE \
##             > ../results/brucei_HTSeq_count_results/${bam_file_name}.counts.txt
## done
Below is a graphical visualization of reads assignment by HTSeq.
HTSeq reads assignment

HTSeq reads assignment

Generating MultiQC report

MultiQC aggregates results from FASTQC, HISAT2 and HTSeq analysis into an HTML formatted single report for better visualization.

#change directory to results
cd ../results

#Run multiqc
multiqc .

# create a directory for the multiQC report and move the output there.
mkdir -p brucei_multiqc_report
mv multiqc* brucei_multiqc_report/

Filtering out non-protein coding genes

Before loading the data into R, filter out the non-protein coding genes which include ncRNA, snRNA, snoRNA, pseudogenic transcripts, rRNA and tRNA.

cat ../scripts/exclude_features.sh
## #!/bin/bash
## # script to exclude gene features from reads counts 
## 
## # create directory for the filtered counts
## mkdir -p ../results/brucei_HTSeq_count_results_mRNA
## 
## for file in ../results/brucei_HTSeq_count_results/*.counts.txt; do
##   counts_file=$(basename "$file" .counts.txt)
##      grep -v -f ../results/excluded_features.txt ${file} > \
##          ../results/brucei_HTSeq_count_results_mRNA/"${counts_file}".counts.txt
## done

Analysis in R

Importing samples count data into R

For further analysis, samples read counts are read into R. To read the sample counts data into R using the script below, simply type source("../scripts/htseq-combine_all.R") on the R console and hit enter. Here, ensure the samples to be excluded in the analysis (SRR039951, SRR039937 and SRR039938) are not among the input files.

cat ../scripts/htseq-combine_all.R
## #!/usr/bin/Rscript
## 
## # Take 'all' htseq-count results and melt them in to one big dataframe
## 
## #Adapted from: https://wiki.bits.vib.be/index.php/NGS_RNASeq_DE_Exercise.4
## 
## # where are we?
## basedir <- "../results"
## setwd(basedir)
## 
## cntdir <- paste(basedir, "brucei_HTSeq_count_results_mRNA", sep="/")
## pat <- ".counts.txt"
## hisat2.all <- list.files(path = cntdir,
##                          pattern = pat,
##                          all.files = TRUE,
##                          recursive = FALSE,
##                          ignore.case = FALSE,
##                          include.dirs = FALSE)
## 
## # we choose the 'all' series
## myfiles <- hisat2.all
## DT <- list()
## 
## # read each file as array element of DT and rename the last 2 cols
## # we created a list of single sample tables
## for (i in 1:length(myfiles) ) {
##   infile = paste(cntdir, myfiles[i], sep = "/")
##   DT[[myfiles[i]]] <- read.table(infile, header = F, stringsAsFactors = FALSE)
##   cnts <- gsub("(.*).counts.txt", "\\1", myfiles[i])
##   colnames(DT[[myfiles[i]]]) <- c("ID", cnts)
## }
## 
## # merge all elements based on first ID columns
## data <- DT[[myfiles[1]]]
## 
## # inspect
## #head(data)
## 
## # we now add each other table with the ID column as key
## for (i in 2:length(myfiles)) {
##   y <- DT[[myfiles[i]]]
##   z <- merge(data, y, by = c("ID"))
##   data <- z
## }
## 
## # ID column becomes rownames
## rownames(data) <- data$ID
## data <- data[,-1]
## 
## ## add total counts per sample
## data <- rbind(data, tot.counts=colSums(data))
## 
## # inspect and look at the top row names!
## #head(data)
## 
## #tail(data)
## 
## ####################################
## # take summary rows to a new table
## # ( not starting with Tb and tmp with invert=TRUE )
## 
## # transpose table for readability
## data.all.summary <- data[grep("^Tb|^tmp", rownames(data), perl=TRUE, invert=TRUE), ]
## 
## # review
## #data.all.summary
## 
## # transpose table
## t(data.all.summary)
## 
## # write summary to file
## write.csv(data.all.summary, file = "brucei_htseq_counts_all-summary.csv")
## 
## ####################################
## # take all data rows to a new table
## 
## data.all <- data[grep("^Tb|^tmp", rownames(data), perl=TRUE, invert=FALSE), ]
## 
## # inspect final merged table
## #head(data.all, 3)
## 
## # write data to file
## write.table(data.all, file = "brucei_htseq_counts_all.txt", quote = FALSE, sep = "\t")
## 
## # cleanup intermediate objects
## rm(y, z, i, DT)
## 
## #return to analysis directory
## setwd("../analysis")

Sample quality check

The quality of the samples is checked before further analysis to check for outlier and batch effects.

# Remove extra column from sample SRR965341 added while reading data into R
data.all["Var.2"] <- NULL

# Create a DGEList object
counts <- DGEList(data.all, group = sample.metadata$Tissue)

# check the number of genes with no expression in all samples
table(rowSums(counts$counts==0)==15)
#FALSE  TRUE 
# 9184   792
##################################################################
#used the alternative function below instead of this
#
# filter out Non-expressed genes and those with low expression values.
# keep genes with at least 1 read per million in n samples, where n 
# is minimum number of replicates which is 3 in this case
keep <- rowSums(cpm(counts)>1) >= 3

# retain genes above the cpm value
#filtered.counts <- counts[keep, , keep.lib.sizes=FALSE]
##################################################################

# Filtering non-expressed and lowly-expressed genes.
#
# Alternative filtering function that filters better (Law et al 2016)
# retain genes above a calculated cpm value
keep.exprs <- filterByExpr(counts, group=sample.metadata$Sample_Name)
filtered.counts <- counts[keep.exprs,, keep.lib.sizes=FALSE]

# obtain logCPM unnormalized for plotting purposes.
# Here, the norm.factors value is 1 for all samples
logcpm.unnorm.counts <- cpm(filtered.counts, log = TRUE, prior.count = 2, normalized.lib.sizes = TRUE)

# Normalize for composition bias using TMM
filtered.counts <- calcNormFactors(filtered.counts, method = 'TMM')

# Convert counts per million per gene to log counts per million for further downstream analysis.
logcpm.norm.counts <- cpm(filtered.counts, log = TRUE, prior.count = 2, normalized.lib.sizes = TRUE)

Various plots are made for the samples before and after normalization.

Samples heatmap

sample_category <- nlevels(sample.metadata$Sample_Name)
colour.palette <- colorRampPalette(brewer.pal(sample_category, "Set2"))(sample_category)
sample.colours <- colour.palette[as.integer(sample.metadata$Sample_Name)]

# Unnormalized sample heatmap
png(filename = "../figures/unnorm_sample_heatmap.png", res =1200, type = "cairo", units = 'in',
    width = 5, height = 4, pointsize = 10)
heatmap.2(cor(logcpm.unnorm.counts), RowSideColors=sample.colours, trace='none', 
          main='Sample correlations')
dev.off()

# Normalized sample heatmap
png(filename = "../figures/norm_sample_heatmap.png", res =1200, type = "cairo", units = 'in',
    width = 5, height = 4, pointsize = 10)
heatmap.2(cor(logcpm.norm.counts), RowSideColors=sample.colours, trace='none', 
          main='Sample correlations')

dev.off()
Filtered-Unnormalized sample correlation heatmap

Filtered-Unnormalized sample correlation heatmap

Normalized sample correlation heatmap

Normalized sample correlation heatmap

Samples density plot

# Checking further using sample density plot

# raw data
log.counts <- log2(counts$counts + 1)
png("../figures/raw_sample_density.png", res =1200, type = "cairo", units = 'in',
   width = 6, height = 6, pointsize = 10)
x <- melt(as.matrix(log.counts))

colnames(x) <- c('gene_id', 'sample', 'log')
ggplot(x, aes(x=log, color=sample)) + geom_density()
dev.off()

# filtered and unnormalized sample data
png("../figures/unnorm_sample_density.png", res =1200, type = "cairo", units = 'in',
    width = 6, height = 4, pointsize = 10)
x <- melt(as.matrix(logcpm.unnorm.counts))

colnames(x) <- c('gene_id', 'sample', 'logcpm')
ggplot(x, aes(x=logcpm, color=sample)) + geom_density()

dev.off()

# filtered and normalized sample data
png("../figures/norm_sample_density.png", res =1200, type = "cairo", units = 'in',
    width = 6, height = 4, pointsize = 10)
x <- melt(as.matrix(logcpm.norm.counts))

colnames(x) <- c('gene_id', 'sample', 'logcpm')
ggplot(x, aes(x=logcpm, color=sample)) + geom_density()

dev.off()
Raw sample density plot

Raw sample density plot

Normalized sample density plot

Normalized sample density plot

Principal component analysis

# PCA

# raw samples PCA
pca.log.counts <- prcomp(t(log.counts)) # raw data (unnormalized and unfiltered)
png(filename = "../figures/raw_samples_PCA.png", res =1200, type = "cairo", units = 'in',
    width = 6, height = 4, pointsize = 10)
autoplot(pca.log.counts,
         data = sample.metadata,
         colour="Sample_Name",
         size=3)
dev.off()

# unnormalized samples PCA
pca.log.counts <- prcomp(t(logcpm.unnorm.counts))  #unnormalized & filtered
png(filename = "../figures/unnorm_sample_PCA.png", res =1200, type = "cairo", units = 'in',
    width = 6, height = 4, pointsize = 10)
autoplot(pca.log.counts,
         data = sample.metadata,
         colour="Sample_Name",
         size=3)
dev.off()


# normalized samples PCA
pca.log.counts <- prcomp(t(logcpm.norm.counts))  #normalized & filtered
png(filename = "../figures/norm_sample_PCA.png", res =1200, type = "cairo", units = 'in',
    width = 6, height = 4, pointsize = 10)
autoplot(pca.log.counts,
         data = sample.metadata,
         colour="Sample_Name",
         size=3)
dev.off()
Raw Sample PCA

Raw Sample PCA

Filtered-Unnormalized sample PCA

Filtered-Unnormalized sample PCA

Normalized sample PCA

Normalized sample PCA

Boxplot

# raw sample boxplot
png(filename = "../figures/raw_sample_boxplot.png", res =1200, type = "cairo", units = 'in',
    width = 4, height = 4, pointsize = 6)
y <- melt(as.matrix(log.counts))

colnames(y) <- c('gene_id', 'sample', 'log')
ggplot(y, aes(x=sample, y=log)) + geom_boxplot() + 
  theme(axis.text.x  = element_text(angle=90, vjust=0.5))
dev.off()

# unnormalized sample boxplot
png(filename = "../figures/unnorm_sample_boxplot.png", res =1200, type = "cairo", units = 'in',
    width = 4, height = 4, pointsize = 6)
y <- melt(as.matrix(logcpm.unnorm.counts))

colnames(y) <- c('gene_id', 'sample', 'logcpm')
ggplot(y, aes(x=sample, y=logcpm)) + geom_boxplot() + 
  theme(axis.text.x  = element_text(angle=90, vjust=0.5))
dev.off()

# normalized sample boxplot
png(filename = "../figures/norm_sample_boxplot.png", res =1200, type = "cairo", units = 'in',
    width = 4, height = 4, pointsize = 6)
y <- melt(as.matrix(logcpm.norm.counts))

colnames(y) <- c('gene_id', 'sample', 'logcpm')
ggplot(y, aes(x=sample, y=logcpm)) + geom_boxplot() + 
  theme(axis.text.x  = element_text(angle=90, vjust=0.5))
dev.off()
Filtered-Unnormalized Sample boxplot

Filtered-Unnormalized Sample boxplot

Normalized sample boxplot

Normalized sample boxplot

Identify differentially expressed genes

# Apply sample grouping based on Tissue from which the sample was derived
design <- model.matrix(~0+sample.metadata$Tissue)
colnames(design) <- levels(sample.metadata$Tissue)

# Estimate dispersions for tags
filtered.counts <- estimateDisp(filtered.counts, design, robust = TRUE)

# Fit a generalized likelihood model to the DGELIST using sample grouping
fit <- glmFit(filtered.counts,design)

#################################################################
# code in this section adapted from https://github.com/iscb-dc-rsg/2016-summer-workshop
# generate a list of all possible pairwise contrasts
condition_pairs <- t(combn(levels(sample.metadata$Tissue), 2))

comparisons <- list()
for (i in 1:nrow(condition_pairs)) {
  comparisons[[i]] <- as.character(condition_pairs[i,])
}

# remove MG to SG comparison
comparisons[[2]] <- NULL

# vector to store differentially expressed genes
sig_genes <- c()

# iterate over the contrasts, and perform a differential expression test for
# each pair
for (conds in comparisons) {
    # generate string contrast formula
    contrast_formula <- paste(conds, collapse=' - ')

    contrast_mat <- makeContrasts(contrasts=contrast_formula, levels=design)
    contrast_lrt <- glmLRT(fit, contrast=contrast_mat)
    topGenes <- topTags(contrast_lrt, n=Inf, p.value=0.05, adjust.method = "BH")
    
    # Grab highly ranked genes
    sig_genes <- union(sig_genes, rownames(topGenes$table))
}

# Filter out genes which were not differentially expressed for any contrast
de.genes <- filtered.counts[rownames(filtered.counts) %in% sig_genes,]
dim(de.genes$counts)
#4188   15
################################################################

# Obtain the counts of genes expressed for each contrast individually
# This aims to obtain the number of genes differentially expressed between 
# the 3 stages of development i.e. MG -> PV, PV -> SG

# Likelihood ratio test to identify DEGs
# SG compared to PV
SG_vs_PV_lrt <- glmLRT(fit, contrast=c(0,-1,1))

# PV compared to MG
PV_vs_MG_lrt <- glmLRT(fit, contrast = c(-1,1,0))


# Genes with most significant differences (using topTags)
# SG compared to PV
topGenes_SG <- topTags(SG_vs_PV_lrt, adjust.method = "BH", p.value = 0.05, n=Inf)
dim(topGenes_SG)
#2973    5

# PV compared to MG
topGenes_PV <- topTags(PV_vs_MG_lrt, adjust.method = "BH", p.value = 0.05, n=Inf)
dim(topGenes_PV)
#2543    5

#Total number of genes: 5516
#######################################################################################
# DE genes at 5% FDR (using decideTestsDGE function)
#
# SG compared to PV
SG_vs_PV_de.genes <- decideTestsDGE(SG_vs_PV_lrt, adjust.method = "BH", p.value = 0.05)

# get summary
summary(SG_vs_PV_de.genes)
#       -1*PV 1*SG
#Down         1375
#NotSig       4417
#Up           1598

# PV compared to MG
PV_vs_MG_de.genes <- decideTestsDGE(PV_vs_MG_lrt, adjust.method = "BH", p.value = 0.05)

# summary
summary(PV_vs_MG_de.genes)
#       -1*MG 1*PV
#Down         1362
#NotSig       4847
#Up           1181

# DE genes in the PV that are common in both comparisons
de.common <- which(PV_vs_MG_de.genes!=0 & SG_vs_PV_de.genes!=0)
length(de.common)
#1328

# create a dataframe with data on PV and SG differential gene expression
PV_data <- topGenes_PV$table
SG_data <- topGenes_SG$table

# append status of regulation for each gene (either upregulated, 1 or downregulated, -1)
## appending result was erroneous; status (+ or -) did not coincide with logFC sign (+ or -). why?
PV_data$RegulationStatus <- PV_vs_MG_de.genes[rownames(PV_vs_MG_de.genes) %in% rownames(PV_data),]
SG_data$RegulationStatus <- SG_vs_PV_de.genes[rownames(SG_vs_PV_de.genes) %in% rownames(SG_data),]

# obtain upregulated and downregulated genes and write out to excel
PV_DownReg <- PV_data[order(PV_data$logFC),]
PV_UpReg <- PV_data[order(PV_data$logFC, decreasing = TRUE),]

SG_DownReg <- SG_data[order(SG_data$logFC),]
SG_UpReg <- SG_data[order(SG_data$logFC, decreasing = TRUE),]

write.xlsx(PV_DownReg, file = "../results/Significant_differentially_expressed_genes.xlsx",
           sheetName = "Down-regulated Proventriculus")
write.xlsx(PV_UpReg, file = "../results/Significant_differentially_expressed_genes.xlsx",
           sheetName = "Up-regulated Proventriculus", append = TRUE)
write.xlsx(SG_DownReg, file = "../results/Significant_differentially_expressed_genes.xlsx",
           sheetName = "Down-regulated Salivary glands", append = TRUE)
write.xlsx(SG_UpReg, file = "../results/Significant_differentially_expressed_genes.xlsx",
           sheetName = "Up-regulated Salivary glands", append = TRUE)

Plotting to visually inspect differential gene expression results.

# Differential expression analysis - plots
#
# Volcano plots

#table with negative log to base 10 transformed p-values and logFC
# create a plot for each comparison PV-MG and SG-PV one at a time.
tab <- data.frame(logFC = topGenes_PV$table$logFC, 
                  negLogPval = -log10(topGenes_PV$table$PValue)) # PV compared to MG

tab <- data.frame(logFC = topGenes_SG$table$logFC, 
                  negLogPval = -log10(topGenes_SG$table$PValue)) # SG compared to PV

# generating the plot
png("../figures/PV-MG_DEG_volcanoplot.png", res =1200, type = "cairo", units = 'in',
    width = 4, height = 4, pointsize = 8)
par(mar = c(5, 4, 4, 4))
plot(tab, pch = 16, cex = 0.6, xlab = expression(log[2]~fold~change),
ylab = expression(-log[10]~pvalue), main = "PV vs MG differentially expressed genes")

## Log2 fold change and p-value cutoffs
lfc <- 2
pval <- 0.05

## Select interesting genes
signGenes <- (abs(tab$logFC) > lfc & tab$negLogPval > -log10(pval))

## Identifying the selected genes
points(tab[signGenes, ], pch = 16, cex = 0.8, col = "red")
abline(h = -log10(pval), col = "green3", lty = 2)
abline(v = c(-lfc, lfc), col = "blue", lty = 2)
mtext(paste("pval =", pval), side = 4, at = -log10(pval), cex = 0.8, line = 0.5, las = 1)
mtext(c(paste("-", lfc, "fold"), paste("+", lfc, "fold")), side = 3, at = c(-lfc, lfc),
cex = 0.8, line = 0.5)

dev.off()

# create a venn diagram to show distribution of the number DEGs between stages
Differentially expressed genes in the proventriculus compared to midgut

Differentially expressed genes in the proventriculus compared to midgut

Differentially expressed genes in the salivary gland compared to proventriculus

Differentially expressed genes in the salivary gland compared to proventriculus

Weighted gene co-expression analysis

# obtain the required counts data (WGCNA input)
# WGCNA requires genes to be in columns
network.counts <- t(logcpm.norm.counts)

# determine the soft-thresholding power to use
powers <- c(c(1:10), seq(from = 12, to=20, by=2))
sft <- pickSoftThreshold(network.counts, powerVector = powers, verbose = 5)

# results
#sizeGrWindow(9, 5)
par(mfrow = c(1,2))

cex1 <- 0.9

# Scale-free topology fit index as a function of the soft-thresholding power
png(filename = "../figures/soft-thresholding_power.png", res =1200, type = "cairo", units = 'in',
    width = 4, height = 4, pointsize = 10)
plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2], 
     xlab="Soft Threshold (power)", 
     ylab="Scale Free Topology Model Fit,signed R^2",type="n", 
     main = paste("Scale independence"));
text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
     labels=powers,cex=cex1,col="red");

# The red line corresponds to using an R^2 cut-off of h
abline(h=0.80,col="red")
dev.off()

# Mean connectivity as a function of the soft-thresholding power
png(filename = "../figures/mean_connectivity.png", res =1200, type = "cairo", units = 'in',
    width = 4, height = 5, pointsize = 10)
plot(sft$fitIndices[,1], sft$fitIndices[,5], 
     xlab="Soft Threshold (power)", 
     ylab="Mean Connectivity", type="n", 
     main = paste("Mean connectivity"))
text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, cex=cex1,col="red")
dev.off()

Plots to determine the soft thresholding power to use.

########################################################################
## Constructing the network
########################################################################

# construct adjacency matrix
softpower <- 12
adjacency.matrix <- adjacency(network.counts, power=softpower,
                                             type = "signed", corFnc = "cor")

#set diagonal to 0 to remove uninformative correlations
diag(adjacency.matrix) <- 0

# Turn the adjacency matrix to topologicaal overlap matrix to minimize
# the effects of noise and spurious associations
TOM <- TOMsimilarity(adjacency.matrix, TOMType = "signed")
dissTOM <- 1 - TOM

# remove adjacency matrix and TOM to free up memory
rm(TOM)
rm(adjacency.matrix)
gc()

# Adjacency matrix heatmap plot / network heatmap of selected genes
heatmap_indices <- sample(nrow(dissTOM), 500) # sub-sample for visualization purposes

png(filename = "../figures/adjacency_matrix_heatmap.png", res =1200, type = "cairo", units = 'in',
    width = 5, height = 5, pointsize = 10)
heatmap.2(t(dissTOM[heatmap_indices, heatmap_indices]),
            col=redgreen(75),
            labRow=NA, labCol=NA, 
            trace='none', dendrogram='row',
            xlab='Gene', ylab='Gene',
            main='Adjacency matrix',
            density.info='none', revC=TRUE)
dev.off()
Adjacency matrix heatmap (500 genes)

Adjacency matrix heatmap (500 genes)

################################################################
## Detecting co-expression modules in R
################################################################

# view the dendrogram based on hierachical clustering of genes
gene.tree <- flashClust(as.dist(dissTOM), method = "average")

# plot the gene tree
png(filename = "../figures/gene_tree.png", res =1200, type = "cairo", units = 'in',
    width = 7, height = 8, pointsize = 10)
#sizeGrWindow(12,9) #open graphical window
plot(gene.tree, xlab="", sub="", main = "Gene clustering based on TOM dissimilarity", 
     labels = FALSE, hang = 0.04)
dev.off()

# identify the modules
module.labels <- cutreeDynamicTree(gene.tree, deepSplit = FALSE, 
                                   minModuleSize = 30)

#view
table(module.labels)

# convert labels to colours
module.colours <- labels2colors(module.labels)

# view
table(module.colours)

# a list of 35 modules

         black           blue          brown           cyan      darkgreen 
           333            611            495            219            118 
      darkgrey    darkmagenta darkolivegreen     darkorange        darkred 
           109             33             44            103            130 
 darkturquoise          green    greenyellow           grey         grey60 
           111            341            272             19            155 
     lightcyan     lightgreen    lightyellow        magenta   midnightblue 
           193            155            143            290            210 
        orange  paleturquoise           pink         purple            red 
           106             73            291            282            340 
     royalblue    saddlebrown         salmon        skyblue      steelblue 
           136             85            234             99             80 
           tan      turquoise         violet          white         yellow 
           265            693             68            101            453

# visualize the gene tree and TOM matrix together using TOM plot
# if necessary, raise dissTOM to a power to make moderately strong connection more visible in heatmap
png(filename = "../figures/gene_tree_and_TOM.png", res =1200, type = "cairo", units = 'in',
    width = 5, height = 6, pointsize = 10)
TOMplot(dissTOM, gene.tree, as.character(module.colours))
dev.off()


# plot gene dendrogram
png(filename = "../figures/gene_tree_and_colours.png", res =1200, type = "cairo", units = 'in',
    width = 5, height = 6, pointsize = 10)
#sizeGrWindow(8,6) #open graphical window
plotDendroAndColors(gene.tree, module.colours, "Dynamic Tree Cut", dendroLabels = FALSE,
                    hang = 0.03, addGuide = TRUE, guideHang = 0.05,
                    main = "Gene dendrogram and module colours")
dev.off()

# get hub genes
module.hub.genes <- chooseTopHubInEachModule(network.counts, module.colours, 
                                             power = 12,type = "signed") #use softhresholding power

# A list of module hub genes
                black                  blue                 brown 
  "Tb10.v4.0128:mRNA"   "Tb927.1.3130:mRNA"   "Tb927.3.1630:mRNA" 
                 cyan             darkgreen              darkgrey 
 "Tb927.11.1450:mRNA"  "Tb927.11.3260:mRNA"  "Tb927.9.14480:mRNA" 
          darkmagenta        darkolivegreen            darkorange 
  "Tb927.3.3730:mRNA"  "Tb927.11.1570:mRNA"   "Tb927.6.2960:mRNA" 
              darkred         darkturquoise                 green 
  "Tb927.3.4000:mRNA"    "Tb927.7.920:mRNA"   "Tb927.2.5810:mRNA" 
          greenyellow                grey60             lightcyan 
  "Tb927.6.3880:mRNA" "Tb927.10.14760:mRNA"   "Tb927.11.900:mRNA" 
           lightgreen           lightyellow               magenta 
  "Tb927.8.5550:mRNA" "Tb927.10.10770:mRNA" "Tb927.10.12780:mRNA" 
         midnightblue                orange         paleturquoise 
  "Tb927.9.6290:mRNA"  "Tb927.9.10100:mRNA"   "Tb927.9.5690:mRNA" 
                 pink                purple                   red 
  "Tb927.8.1650:mRNA"   "Tb927.9.5960:mRNA"   "Tb927.5.3130:mRNA" 
            royalblue           saddlebrown                salmon 
  "Tb927.6.3150:mRNA"   "Tb927.9.2600:mRNA"  "Tb927.11.5450:mRNA" 
              skyblue             steelblue                   tan 
  "Tb927.6.2790:mRNA"   "Tb927.7.2500:mRNA"   "Tb927.3.1380:mRNA" 
            turquoise                violet                 white 
  "Tb927.9.5240:mRNA"   "Tb927.3.3270:mRNA"   "Tb927.3.1370:mRNA" 
               yellow 
  "Tb927.7.7200:mRNA"
Gene tree and module colours

Gene tree and module colours

The section below is included for further checks so it may not be necessary to carry out this analysis.

# --------------------------------------------------------------------------------------------
# merge modules with very similar expression profiles as their genes are highly co-expressed
# get the module eigengenes
module.eigengenes <- moduleEigengenes(network.counts, colors = module.colours)$eigengenes

# calculate dissimilarity of module eigengenes using correlations
module.eigengenes.diss <- 1 - cor(module.eigengenes)

# cluster module eigengenes
module.eigengenes.tree <- flashClust(as.dist(module.eigengenes.diss), method = "average")

# choose height at which to cut the tree for merge i.e. the threshold
module.eigengenes.thresh <- 0.25

# create plots for the results
png(filename = "../figures/module_eigengenes_cluster.png", res =1200, type = "cairo", units = 'in',
    width = 5, height = 6, pointsize = 10)
#sizeGrWindow(7, 6)
plot(module.eigengenes.tree, main = "Clustering of module eigengenes", xlab = "", sub = "")
abline(h=module.eigengenes.thresh, col="red")

dev.off()

# merge the modules
module.eigengenes.merge <- mergeCloseModules(network.counts, module.colours, 
                                             cutHeight = module.eigengenes.thresh)

# merged module colours
merged.module.colours <- module.eigengenes.merge$colors

# view
table(merged.module.colours)

# a list of 14 modules

         black           cyan      darkgreen       darkgrey darkolivegreen 
           828            583            229            109            780 
         green    greenyellow           grey         grey60  paleturquoise 
           794            802             19            370             73 
          pink      royalblue      steelblue          white 
          1771            501             80            451

# eigengenes of new merged modules
merged.module.eigengenes <- module.eigengenes.merge$newMEs

# plot the dendrogram with original and merged colours underneath
#sizeGrWindow(12, 9)
png(filename = "../figures/merged-original_colours-original_dendro.png", res =1200, type = "cairo", 
    units = 'in', width = 5, height = 6, pointsize = 10)
plotDendroAndColors(gene.tree, cbind(module.colours, merged.module.colours), 
                    c("Dynamic Tree Cut", "Merged dynamic"), 
                    dendroLabels = FALSE, hang = 0.03, addGuide = TRUE, guideHang = 0.05)
dev.off()

# plot heatmap of eigengenes (orginal before merge)
png(filename = "../figures/eigengenes_heatmap.png", res =1200, type = "cairo", units = 'in',
    width = 5, height = 6, pointsize = 10)
plotEigengeneNetworks(module.eigengenes, "Eigengenes heatmap", marHeatmap = c(3,4,2,2),
                      plotDendrograms = FALSE, xLabelsAngle = 90)
dev.off()

#-----------------------------------------------------------------------------------------
# rename some variables based on the module eigengene analysis for later use
#
# module colours
#module.colours <- merged.module.colours

# construct numerical labels corresponding to the colours
colorOrder <- c("grey", standardColors(50))
module.labels <- match(module.colours, colorOrder)-1

# module eigengenes
#module.eigengenes <- merged.module.eigengenes

# get hub genes
merged.module.hub.genes <- chooseTopHubInEachModule(network.counts,
                                                    merged.module.colours,
                                                    power = 12,
                                                    type = "signed")#use softhresholding power

# a list of merged module hub genes
                 black                   cyan              darkgreen 
"Tb927.11.1690.2:mRNA"    "Tb927.4.1390:mRNA"  "Tb927.10.12710:mRNA" 
              darkgrey         darkolivegreen                  green 
  "Tb927.9.14480:mRNA"    "Tb927.4.4730:mRNA"    "Tb927.11.760:mRNA" 
           greenyellow                 grey60          paleturquoise 
 "Tb927.11.13610:mRNA"  "Tb927.10.14760:mRNA"    "Tb927.9.5690:mRNA" 
                  pink              royalblue              steelblue 
    "Tb927.5.630:mRNA"  "Tb927.10.12750:mRNA"    "Tb927.7.2500:mRNA" 
                 white 
    "Tb927.1.580:mRNA"
##############################################################################
## Network export to cytoscape
##############################################################################

# select modules of interest
interesting.modules <- c('black', 'cyan', 'grey','brown',
                         'midnightblue','blue','darkgreen','tan', 'darkgrey','darkorange',
                         'darkred','darkturquoise','green','greenyellow','grey60','lightcyan',
                         'lightgreen','lightyellow','magenta','orange','pink','purple','red',
                         'royalblue','salmon','skyblue','turquoise','white','yellow', 
                         'saddlebrown', 'steelblue','darkmagenta','darkolivegreen','paleturquoise',
                         'violet') # these are all module colours, thus the whole network

# obtain gene ids
gene.ids <- rownames(logcpm.norm.counts)

# select module genes
inModules <- is.finite(match(module.colours, interesting.modules))
modGenes <- gene.ids[inModules]

# select the corresponding dissTOM based on module genes
modTOM <- dissTOM[inModules, inModules]
dimnames(modTOM) <- list(modGenes, modGenes)

# Export the network into edge and node list files Cytoscape can read
cyt <- exportNetworkToCytoscape(modTOM, 
                                edgeFile = "../results/CytoscapeInput-edges.txt", 
                                nodeFile = "../results/CytoscapeInput-nodes.txt", 
                                weighted = TRUE, 
                                threshold = 0, 
                                nodeNames = modGenes, 
                                nodeAttr = module.colours[inModules]);



# remove the cytoscape network
rm(cyt)

# Also, export the network as graphml format
# use export_network_to_graphml function
source("../scripts/network_export_graphml.R")

# the whole network
entire.network <- export_network_to_graphml(dissTOM, 
                                            filename = "../results/entire_network.graphml",
                                            threshold = 0,
                                            nodeAttr = gene.ids,
                                            max_edge_ratio = 3)
# network modules
# create a dataframe with node attributes
node.attributes <- cbind(modGenes, module=module.colours)

# Add RGB versions of colour modules
node.attributes$colourRGB <- col2hex(node.attributes$module)

modules.network <- export_network_to_graphml(modTOM, 
                                             filename = "../results/modules_network.graphml",
                                             threshold = 0.02,
                                             nodeAttrDataFrame = node.attributes)
Subnetwork of 1399 genes

Subnetwork of 1399 genes

FIRE (Finding Informative Regulatory Elements)

Module genes and their module labels are required as input by FIRE (Finding Informative Regulatory Elements). Hence we write out text files with this input.

# write out cluster/module genes and their corresponding module labels for use by 
# FIRE (Finding Informative Regulatory Elements)

fire.clusters <- data.frame(modGenes, module.labels[inModules])

# sort by module labels; FIRE input should start from 0 in module labels column.
fire.clusters <- fire.clusters[order(fire.clusters$module.labels.inModules.), c(1,2)]

write.table(as.matrix(fire.clusters), file = "../results/tbrucei_FIRE_expression_clusters.txt", 
            quote = FALSE, row.names = FALSE, sep = "\t")

# Also, write out module genes in a text file.
write.table(data.frame(modGenes), file = "../results/tbrucei_module_genes.txt", row.names = FALSE, 
            quote = FALSE, col.names = FALSE)

Prepare input sequences for FIRE by obtaining module genes sequences from T. brucei’s entire annotated transcripts. Use seqtk package.

 seqtk subseq \
 ../data/tbrucei_annotated_transcripts/TriTrypDB-43_TbruceiTREU927_AnnotatedTranscripts.fasta \
 ../results/tbrucei_module_genes.txt > \
 ../results/tbrucei_module_genes_sequences.fasta

# remove unrequired info from header and retain transcript id.
cut -f1 -d "|" ../results/tbrucei_module_genes_sequences.fasta > \
../results/tbrucei_module_genes_sequences_without_delimiter.fasta

# remove trailing space after header
cut -f1 -d " " ../results/tbrucei_module_genes_sequences.fasta > \
../results/tbrucei_module_genes_sequences_without_delimiter.fasta

# FIRE commands - FIRE should be run in the directory where all scripts were installed. Therefore,
# the input files should be copied to FIRE's directory (FIRE-x.x/)
#
#check whether input files are OK
perl TOOLS/FIRE_analyse_input_files.pl \
-fastafile tbrucei_module_genes_sequences_without_delimiter.fasta \
-expfile tbrucei_FIRE_expression_clusters.txt

# remove sequences >= 10,000 base pairs as they make FIRE crash
# code - https://www.biostars.org/p/62678/
 cat ../results/tbrucei_module_genes_sequences_without_delimiter.fasta | \
 awk '{y= i++ % 2 ; L[y]=$0; if(y==1 && length(L[1])<=9999) {printf("%s\n%s\n",L[0],L[1]);}}' > \
../results/tbrucei_module_genes_sequences_without_delimiter_filtered.fasta

# recheck input files
perl TOOLS/FIRE_analyse_input_files.pl \
-fastafile tbrucei_module_genes_sequences_without_delimiter_filtered.fasta \
-expfile tbrucei_FIRE_expression_clusters.txt

# output
#Checking the expression file
#Expression file is OK.
#
#Checking the fasta file
#Fasta file is OK (min sequence length = 98, max length = 9917).
#
#Found fasta sequence for 7272 / 7390 identifiers in expression file.

# Analyze sequences for motifs
 perl fire.pl --expfiles=tbrucei_FIRE_expression_clusters.txt --exptype=discrete \
 --fastafile_dna=tbrucei_module_genes_sequences_without_delimiter_filtered.fasta  --nodups=1

# create HTML output of the results
perl MORESCRIPTS/makeresultindex.pl tbrucei_FIRE_expression_clusters.txt "T. brucei cluster motifs"